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ABSTRACT 



This paper presents an introduction to language engineering 
software, especially for computerized language and text corpora. The focus of 
the paper is on small and relatively independent pieces of software designed 
for specific, often low-level language analysis tasks, and on tools in the 
public domain. Discussion begins with the application of standards to 
language corpora, and the role of information technology in promoting 
standardization. Current international standards and statistical tools are 
then examined briefly, and computational linguistic tools (morphological 
analyzers, implementation of formalisms, lexicon development environments) 
are noted. Problems in using public domain tools, and prospects for their 
resolution, are also discussed. A list of related World Wide Web resources is 
included. Contains 15 references. (MSE) 
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1. Introduction 

This paper gives an introduction to language engineering software, 
especially as it relates to computerised textual corpora. The focus of the paper 
is on language engineering tools, i.e. relatively small and independent pieces 
of software, meant for a particular, usually low-level task. Other, larger and 
more complex systems will be mentioned as well, as long as they are con- 
nected to the processing of textual material, in particular to corpus produc- 
tion and, to some extent, its utilisation. The paper does not discuss tools 
dealing with speech production or recognition although some of the corpus 
tools are relevant for producing speech corpora as well. 

Even though the focus of this article will be on public domain tools, some 
software will be mentioned that does not, strictly speaking, belong in this 
category. There is a substantial variety of conditions that authors impose on 
their software, with proprietary, commercial products on one end, and freely 
available public domain software, that can be used for any purpose at the 
other end of the spectrum. Quite a few interesting linguistic tools fall some- 
where between these two extremes with the most common conditions being 
that the software may be freely used for non-profit purposes only or that it 
falls under the GNU’s general public license, which essentially forbids such 
software from being incorporated into proprietary programs. Furthermore, 
some authors request an explicit license to be signed before releasing their 
software. Nevertheless, such systems, even though not in the public domain, 
can be used by academic users and, in certain cases, can be of substantial use 
in an industrial environment as well. So, for example, if the software publicly 
released for academic use only is of sufficient interest, an arrangement can 
usually be made with the authors, or, if the software is a GNU library, it can 
still be used by proprietary software, as long as it does so in accordance with 
the GNU library general public license. 

Using public domain tools has several obvious benefits, which are probably 
greatest for smaller research teams. These often lack funds to buy proprietary 
software or manpower for in-house development. Besides the obvious benefit 
of being for free, public domain tools allow for exploring a particular techno- 
logy; even if a tool is not exactly what is required, the source code (where 
available) can be modified to suit particular needs. With public domain lin- 
guistic software that incorporates language particular resources, these re- 
sources (e.g., a morphological rule-base) can also be reused locally. Of course, 
the problems associated with using public domain tools should not be under- 
estimated. These, along with some future prospects for their resolution will 
be discussed in the concluding section. Finally, for many tasks, in particular 
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for corpus work, commercial software is simply not available. Therefore, the 
available options are narrowed to using (and possibly modifying) public 
domain tools, or re-inventing the wheel by in-house development of the 
software. 

The rest of the article is organised as follows: first, some introductory 
remarks are given on corpora and their connection with standards and 
technological advances; next the Unix platform, as the preferred development 
environment is discussed; this is followed by a section on SGML and statis- 
tical tools and a section on computational linguistic tools. Finally, some 
drawbacks to using public domain tools are given, together with recent efforts 
in this filed, concluding with a list or Web sites relevant to language en- 
gineering tools. 



2. Corpora, Standards and the Internet 

Recent years have seen a steep growth of computer corpora which have 
increased in size, number, and variety. Two of the more impressive examples 
of this trend are the hundred million annotated words of the British National 
Corpus and the hundred CD-ROMs of language resources offered by the 
Linguistic Data Consortium, which include annotated spoken corpora and 
multilingual corpora, both from a variety of sources. 

The increased ability to produce and disseminate corpora is to a large 
extent due to technological advances. The dropping price and large capacities 
of mass storage media mean that large and heavily annotated corpora can be 
easily stored on-line. The growth in electronic communications, especially the 
success of the World Wide Web enables information on language resources or 
the resources themselves to be offered and accessed globally. In addition, the 
growing acceptance of certain standards that enable the exchange and plat- 
form independence of corpora encodings have also had a important impact on 
corpora availability and reuse. In particular, the Text Encoding Initiative 
guidelines (Sperberg-McQueen&Burnard, 1994, Ide&Veronis, 1995), which 
adopt the ISO standard SGML (Goldfarb, 1990) as their markup (metalan- 
guage are a significant contribution to the standardisation effort in this area. 

This ease of availability and adoption of standards is important not only 
for corpora themselves, but also for software that helps in producing and, to 
a lesser extent, utilising these corpora. Internet connections mean that such 
tools as are offered to the public can be easily down-loaded or, in some cases, 
demonstrated, while the increasing adoption of standards minimises porta- 
bility and interface problems. 
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3. The Unix platform 

Although publicly available software does exist for PCs and Macs, it is 
Unix that is practically based on the notion of free software. This is due as 
much to the developmental history of Unix as to the GNU initiative of the 
Free Software Foundation (FSF). Although GNU software is not public 
domain, it can be used and modified freely, as long as it is not incorporated 
into proprietary systems. The utility of Unix as a development environment 
is due to it being a very powerful system, with a “amorphous” structure, that 
imposes relatively few constraints on program developers. For these reasons, 
it will be primarily Unix software that will be discussed in this article. 

It should be noted, however, that it is also because of its power and 
reliance on free software that Unix is in many ways a troublesome system to 
use and maintain. Furthermore, Unix comes in a thousand and one flavours, 
depending on the exact platform in use (e.g., Solaris for SUN Spares, Linux 
for PCs, Irix for SGIs, etc.), thus, often making the installation of new 
programs a difficult undertaking. 

I list next some software which runs on Unix (but often on other plat- 
forms as well) and can be of use in language engineering. First, Unix offers a 
variety of tools that do a specific job well, for example string or regular 
expression searching (grep) or sorting (sort). If programming languages can be 
thought of as “tools", then Unix offers a large selection of use in writing 
programs for language processing. The general purpose programming language 
which is currently, and probably for a while to come, the de facto standard 
is ANSI C and it’s object-oriented extension C++. Unix also offers a number 
of languages that are particularly suited for string processing, such as sed 
(Doucherty, 1991) and awk (Aho et al., 1988). Perl deserves particular men- 
tion (Wall&Schwartz, 1991): it is suitable as much for writing short, throw- 
away programs as for complex conversion tasks. Finally, the GNU editor, 
Emacs, must be mentioned which, although as with most things with Unix, 
has a long learning curve, does offer very powerful functionality, and is freely 
extensible in its variant of Lisp. 



4. SGML and Statistical Tools 

The view of the corpus building process adopted here revolves around 
(presumably TEI conformant) SGML as the underlying data representation 
format. The evolution of a corpus is seen as composed of three stages. The 
corpus texts will usually be obtained in some sort of machine readable legacy 
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format (e.g., an ASCII representation, RTF from Word files, etc.) which are 
first up-translated into a corpus-wide encoding format, i.e., SGML. This 
bibliographically and, to an extent, structurally encoded corpus is then usual- 
ly additionally (SGML) annotated in a number of ways, for example, for part- 
of-speech or multilingual alignment. Finally, a corpus is utilised by searching 
and rendering its material, for example, by showing keywords in context that 
match a given criterion or by showing aligned multilingual texts side by side. 

The Perl language is well suited for up-translation to SGML as well as most 
(non-linguistic) conversions of SGML documents; there also exists an SGML- 
aware Perl library perlSGML, written by Earl Hood. Another general pur- 
pose programming language which is in the public domain and particularly 
suited to the manipulation of character strings is Icon (Griswold&Griswald, 
1990 ). It was developed at the University of Arizona, and was extensively 
used in the British National Corpus project. 

The basic SGML tool is the validating parser that checks for syntactic well- 
formedness of SGML documents and reports errors in case the document is 
not well-formed. A number of such validators exist, quite a few of which are 
in the public domain, e.g., James Clark’s sgmls and sp. The second essential 
“tool” that is needed is an SGML-aware editor. Most of these are commercial 
software; however, Emacs does have a special mode (psgml), meant for 
editing SGML files. 

There also exist freely available programs specifically designed for conver- 
sion of SGML documents (although not designed for handling corpus data), 
for example, the Copenhagen SGML tool CoST and MID’s MetaMorphosis. 
These allow for transformations of SGML documents and also for rendering 
SGML annotated data. These tools and others can be found in the various 
Internet SGML repositories. 

While there are quite a few tools available for corpus development, the 
choice of corpus querying tools is much more limited. While some of the 
above tools might prove useful in designing such a system, an integrated 
corpus query system must combine speed, a powerful querying language, and 
a display engine. Some such systems have been developed for DOS, but they 
usually lacked support for non-English languages and relied on idiosyncratic 
corpus encoding schemes. One Unix system that is offered for research pur- 
poses is the Corpus Query System cqp/Xkwic by Stuttgart’s Institute fur 
Maschinelle Sprachverarbeitung (IMS). The corpus query processor cqp is a 
command-language based query interpreter, which can be used independently 
or by Xkwic, which is a X-windows graphical user interface. 

The last part of this section mentions some statistical linguistic tools used 
for corpus annotation. Part-of-speech taggers take as their input a word-form 
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together with all its possible morphosyntactic interpretations and output its 
most likely interpretation, given the context in which the word-form appears. 
So, for example, the word-form “tags” by itself can be interpreted as the 
plural noun or as the third person singular verb, whereas in the context “it 
tags word-forms” only the verb interpretation is correct. While syntactic 
parsers also perform such disambiguation, pure rule-based approaches tend to 
have low coverage and speed, and the investment into building rulesets for a 
particular language is prohibitive. 

Recently there has been an increased interest in statistically based part-of- 
speech taggers, which use the local context of a word form for morpho- 
syntactic disambiguation. Such taggers have the advantage of being fast and 
can be automatically trained on a pretagged corpus. Their success rate de- 
pends on many factors, but is usually at or below 96%. Two better known 
such taggers in the public domain are the Markov model-based Xerox tagger 
written in Lisp (Cutting et ah, 1992) and Brill’s rule-based tagger in C (Brill, 1992). 

Finally, another pure statistical tool is the Gale and Church aligner, 
(Gale&Church, 1993) which sentence-aligns a text and its translation. It 
produces surprisingly good results by very simple means as it incorporates no 
linguistic knowledge but makes use of the basic insight that a text and its 
translation will have roughly the same number of characters. 



5. Computational linguistic tools 

This section deals with software that belongs to computational linguistics 
proper and includes morphological analysers, implementations of formalisms, 
and lexicon development environments. These systems can hardly be con- 
sidered “tools” as they are often large and complex. They are, furthermore, 
only distantly connected to corpus development or exploitation. Nevertheless, 
they provide an environment for advanced language engineering tasks (e.g., 
machine translation) and it would be remiss not to mention them. 

For morphological analysis and synthesis, Koskenniemi’s finite-state two- 
level model is by far the most widely used and investigated. It is primarily 
meant to deal with spelling changes at or near morpheme boundaries. The 
best known implementation is probably PC-KIMMO (Antworth, 1990) 
although a number of other implementations also exist. Information about 
them, as well as about other systems, not based on the two-level model is 
available from the Saarbriicken’s DFKI Natural Language Software Registry. 

For general lexical structuring, including, but not limited to morphological 
dependencies, a simple, yet powerful and efficient language is DATR 
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(Evans&Gazdar, 1990). DATR is a lexical knowledge representation language 
in which it is possible to define networks allowing multiple default in- 
heritance. The original Sussex version, which is publicly available, is imple- 
mented in Prolog. 

Syntactic parsing, which usually forms the basis of more advanced language 
engineering applications has probably been the subject of most research in 
computational linguistics. It is, therefore, not surprising that a host of (public 
domain) parsing programs are available. For example, Prolog implementations 
usually offer a Definite Clause Grammar (DCG) module, and a number of 
various parsers (chart, Tomita, etc.) are available via the Internet, e.g., via 
DFKI. 

Apart from DCG, the best known unification-based context-free parser is 
the PATR system (Shieber et al., 1983). More recent unification-based systems 
have replaced untyped feature structures of PATR with typed ones, thus, 
conferring the benefits of type checking and type inheritance to their gram- 
mars. Given that these systems can be used for other purposes apart from just 
parsing (e.g., machine translation), they are better classified as implemen- 
tations of linguistic formalisms. There are a number of such systems available, 
pointers to which can be, again, found at the DFKI Web page. Here we will 
mention only three of the better known ones. The Attribute Logic Engine 
(ALE) (Carpenter&Penn, 1994) is written in Prolog and incorporates a chart 
parser and lexical rules. It is optimised for speed of processing which, how- 
ever, makes it less than ideal for a grammar development environment. IMS 
offers two systems: Comprehensive Unification Grammar (CUF) 

(Dorre&Eisele, 1991) written in Prolog and Typed Feature Formalism (TFS) 
(Zajac, 1992) in Lisp. Especially CUF offers a very powerful grammar de- 
velopment environment; for a detailed comparison between ALE, CUF, and 
TFS, see also (Manandhar, 1993). Finally, it should be noted that of the three 
systems, ALE is available in source code, the other two being distributed in 
their compiled version only. 



6. Drawbacks and Prospects 

The penalties of using public domain tools should not be underestimated: 
the tools often do not come with all the bugs ironed out, with a detailed 
documentation or with the exact functionality required on the platform that 
we have. Maintenance is also often lacking as the developers* interests can 
have turned to other areas and support is, of course, a voluntary effort and 
cannot be counted on. For the field of multilingual language engineering, an 
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especially serious problem is that most of the available linguistic engineering 
software (public domain or commercial) to date was written for the English 
language or at best for (major) European Union (EU) languages. This bias 
gives rise to problems with “foreign” character sets, collating sequences, and 
the format of dates, numbers, and the like. 

A connected problem is the lack of standards or, in some cases, conflicting 
standards for software development. This gives rise to tools that are often 
incompatible with one another, e.g., by virtue having different input/output 
formats and protocols. This can make their integration a daunting task, 
requiring extensive modifications of the tools. This is not to say that stan- 
dards concerning software development have not been developed or are being 
considered by various institutions, e.g., by ISO and FSF. The work that is 
specifically addressed towards linguistic engineering are the Guidelines for 
Linguistic Software Development , which are being produced as a joint effort of 
the EU sponsored MULTEXT and MULTEXT-East projects and the Eagles 
sub-group on Tools, established in spring 1995. In particular, these guidelines 
are to address questions of usability, portability, compatibility and ex- 
tensibility of linguistic software, concentrating in the first place on the Unix 
environment. 

A number of other European Union projects have been concerned with 
developing linguistic software. However, in most cases the produced software 
is proprietary and, hence, not publicly available. A notable exception is the 
MULTEXT(-East) (this volume) project, which aims to make freely available 
to the academic community a number of SGML-based corpus processing 
tools. These include re-implementations of the already mentioned Xerox 
tagger, the Gale & Church aligner and a morphological synthesiser based on 
the two-level model.. 

For obtaining the tools mentioned in this paper, as well as a host of others, 
the easiest way is via the Web. A number of Web sites that provide further 
pointers to resources of interest to linguistic engineering, already exist: some 
of them are listed below. While some are the product of voluntary effort by 
individuals, there are also official bodies that disseminate information via the 
Internet, e.g., the DFKI Natural Language Software Registry or the EU 
Relator project. In connection with this, the pioneering effort of Edinburgh’s 
Language Technology Group should also be mentioned: LTG offers a Lan- 
guage Software Helpdesk , which is a free service dedicated to the support of 
public domain, and freely available software for natural language processing 
and the fostering of its use in practical applications. 

Finally, the TELRI Concerted Action also has a working group on “Ling- 
ware Dissemination”. Its purpose is to increase the availability of language 
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engineering tools by making available, via the Web, information on extant 
tools, by providing the public tools of TELRI partners, tools, and by 
improving such tools by adapting them to various languages and platforms. 



7. WWW References 

The following World-Wide Web pages can provide more information on 
many of the topics and tools introduced above: 

SGML Web Page by Robin Cover and SGML Open: 
http:/ / www.sil.org/ sgml/ sgml.html 
http:/ / www.sgmlopen.org/ 

TEI home page: 

http:/ / www-tei.uic.edu/ orgs/tei/ 

British National Corpus and Linguistic Data Consortium: 
http://info.ox.ac.uk/bnc/ 
http:/ / www.cis.upenn.edu/ldc/ 

GNU software ftp site (including Emacs, Perl, grep, etc.) and online docu- 
mentation for GNU software: 
ftp:/ / prep.ai.mit.edu/ pub/ gnu/ 
http:// www.ns.utk.edu/ gnu/ 

SGML repository at Institute for Informatics, Oslo (including psgml, sgmls 
and sp) and Steve Pepper’s Whirlwind Guide to SGML Tools: 
ftp:/ / ftp.ifi.uio.no/ pub/SGML/ 
http:/ / www.falch.no/ people/ pepper/ sgmltool/ 

Taggers by Xerox and Brill: 

ftp:// parcftp.xerox.com/ pub/ tagger/ 
ftp://blaze.cs.jhu.edu/ pub/brill/Programs/ 

DFKI Natural Language Software Registry and the IMS list of language 
engineering links: 

http:/ /cl-www.dfki.uni-sb.de/ cl/ registry/ draft.html 
http:// www.ims.uni-stuttgart.de/info/FTPServer.html 
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SIL’s Software (including PC-PATR, PC-KIMMO): 
http:/ / www.sil.org/ computing/ sil_computing.html 

DATR ftp site: 

ftp:/ / ftp.cogs.sussex.ac.ulc/pub/nlp/DATR/ 

ALE home page: 

http:/ / macduff.andrew.cmu.edu/ ale/ 

IDS’s Tools and resources, including CQP/Xkwic, CUF and TFS: 
http:// www.ims.uni-stuttgart.de/Tools/ToolsAndResources.html 

LTG’s Language Software Helpdesk: 

http:/ / www.ltg.hcrc.ed.ac.uk/projects/helpdesk/ 

MULTEXT with Eagles Guidelines for Linguistic Software Development and 
MULTEXT-East: 

http:/ / www.lpl.univ-aix.fr/ projects/ multext/ 
http://www.lpl.univ-aix.fr/projects/multext-east/ 

Eagles and Relator: 

http:/ / www.ilc.pi.cnr.it/EAGLES/home.html 
http://www.de. relator.research.ec.org:80/lg=en/index.mlhtml 

TELRI and the Web version of this article: 
http:/ / www.ids-mannheim.de/ telri/ telri.html 
http:/ / nl.ijs.si/ telri-wg5/ pub-tools/ 
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